We listen to music in numerous ways: emotionally, analytically (for production and instrumental technique), narratively (for a story). So with all of these factors, what makes a good song? This project will try to predict music popularity (using The Grammy’s Recording Academy and Billboard Top Charts historical data) based on Spotify-Audio-features (AFTS). Detailed below and even further in the codebook, AFTS capture the different ways we listen to music. We will run (1) a classification analysis on the Grammy’s data and (2) a regression analysis on the Billboard’s data.
Note: See the Data Collection for a more detailed explanation on how these records were extracted. The Recording Academy’s Grammy Awards are the most esteemed music awards and the only peer-recognized awards in the industry to date. Collected for this project are the historical “Song of the Year” nominations and winners from 1960-2021. The Billboard Charts are another measure of music popularity based off of sales, radio play, and streaming. In this data set, we have the historical annual Top 100 Year-End charts from 1960-2021. We will try to predict two things: (1) What level of AFTS are associated with Grammy nominations and wins for Song of the Year. (2) What level of AFTS are associated with Billboards positioning (1 to 100, 1 being the most popular). Between these two measurements, The Grammy’s and Billboard’s data, we capture both a high-level critical view as well as the public consensus for music popularity.
See the most recent Grammy’s Awards Nominations/Winners (2021)
Note: See the Data Collection for a more detailed explanation on how these records were extracted. Spotify API allows you to extract several audio features for each song in their database. These values are based off of a machine learning audio analysis process. For this project we will use the following selected AFTS: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, and valence. For a detailed description of each othese variables, please refer to the codebook in my zipped files and/or this link. Overall, these characteristics together captures the way we listen to music (analytically, narratively, emotionally). Please visit The Spotify Developers documentation to learn more about the Spotify API.
The data collection for this project required a couple steps. The
first was to create a data frame that combined each year of
nominations/winners at the Grammy’s and Billboard’s Rankings. I found
annnual CSVs for each year from 1960 to 2021 for sources through
Wikipedia. For Grammy’s data I manually added a the win
variable and year. For the Billboard’s data, the position was already a
column, so I only added year. So at this point I had 2 data
sets: 1 for Grammy’s data, and 1 for Billboards data. I then wrote a
python function that merged these two data sets. Please see the
manage.py file in my zipped files for details. This basically matched or
appended the Grammy’s data to the Billboard’s data. Thus each row had a
song name, artist name, position ranking, and Grammy’s status. For songs
that were not on the Billboard’s charts but were nominated were assigned
a ranking of 101. I added a decade variable
for EDA purposes. In this preliminary data set, there were around 6300
observations. Next, I created a Python app that connected to Spotify
API. This collection process took a couple weeks due to run-time and the
amount of calls! My computer did not enjoy processing everything. Please
see the collect.py file in my zipped files. Essentially this app took
each row’s song name and artist name from the preliminary data set and
queried the Spotify data base for AFTS, and then if the song was found
AFTS values were filled in for respective columns. Since, some names
were spelled incorrectly from Wikipedia, I had to double check this data
frame. So in this final step, I double checked missing data manually and
filled in AFTS where I could. However, there were still about twenty
songs that could not be found in Spotify. I decided to drop these
observations because AFTS are essential.
dim(data)
## [1] 6291 15
There are 6291 rows and 15 columns. Each observation is a song.
A brief description of some of the variables are as follows:
songname : Name of the song
artistname : Artist name(s) of the song
year: year of rating/award
decade : decade of rating/award
position: position on the Billboards chart (1 to
100{1 being best}, 101{if not on chart})
win: indicator for Grammy Song of the Year
distinction (2{won}, 0{nominated}, 3{otherwise})
audio feature variables (AFTS): Spotify rated audio
feature variables, see codebook for specifics
acousticness: how acoustic the song is, rating
0.00-1.00, 1.00 being most acousticdanceability: how suitable a song is to dance to,
rating 0.00-1.00, 1.00 being more danceableenergy: how intense, noisy, loud, and fast a song is,
rating 0.00-1.00, 1.00 being most energeticinstrumentalness: predicts whether a track contains
mostly instruments (no vocals) rating 0.00-1.00, 1.00 being mostly
instrumentsliveness: how upbeat the song is, rating 0.00-1.00,
1.00 being more upbeatloudness: on average how loud the song is, measured in
decibelsspeechiness: how spoken the song is, rating 0.00-1.00,
1.00 being mostly spokentempo: on average the estimated tempo of the song,
measured in beats per minutevalence: how happy/positive sounding the song is,
rating 0.00-1.00, 1.00 having more valencesee codebook in zipped files for more details on audio features and the other variables
Due to the web scraping process, the data frame is fairly tidy. Here we will convert nominal variables to factors, make any level adjustments, and clean up any other aspects of the data that will improve the efficiency of the analysis.
data$year <- as.factor(data$year)
data$win <- as.factor(data$win)
data$decade <- as.factor(data$decade)
levels(data$win) <- c("none", "nominated", "won")
levels(data$decade) <- c("60s", "70s", "80s", "90s", "200s", "2010s", "2020s")
music_hist <- data
head(music_hist, n = 50) %>%
kable %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = '700px', height = "300px")
| songname | artistname | year | decade | position | win | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | 0.5910 | 0.326 | 0.3260 | 9.18e-01 | 0.4940 | -15.144 | 0.0297 | 186.232 | 0.870 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | 0.9090 | 0.554 | 0.1860 | 1.44e-03 | 0.1100 | -15.846 | 0.0379 | 81.181 | 0.200 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | 0.5480 | 0.507 | 0.5610 | 0.00e+00 | 0.5900 | -10.472 | 0.0353 | 120.042 | 0.890 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | 0.7750 | 0.760 | 0.4680 | 2.36e-05 | 0.1840 | -8.957 | 0.0482 | 119.986 | 0.745 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | 0.9360 | 0.569 | 0.0638 | 0.00e+00 | 0.1220 | -18.548 | 0.0458 | 101.521 | 0.282 |
| I’m Sorry | Brenda Lee | 1960 | 60s | 6 | none | 0.8680 | 0.558 | 0.2230 | 9.72e-04 | 0.1300 | -12.362 | 0.0299 | 101.711 | 0.303 |
| It’s Now or Never | Elvis Presley | 1960 | 60s | 7 | none | 0.6420 | 0.643 | 0.4910 | 9.72e-03 | 0.2860 | -9.312 | 0.0344 | 126.399 | 0.753 |
| Handy Man | Jimmy Jones | 1960 | 60s | 8 | none | 0.4000 | 0.477 | 0.7810 | 0.00e+00 | 0.3310 | -6.931 | 0.0654 | 144.976 | 0.750 |
| Stuck on You | Elvis Presley | 1960 | 60s | 9 | none | 0.7580 | 0.647 | 0.5130 | 8.60e-06 | 0.1080 | -12.372 | 0.0421 | 131.641 | 0.955 |
| The Twist | Chubby Checker | 1960 | 60s | 10 | none | 0.1800 | 0.571 | 0.6180 | 2.70e-06 | 0.1780 | -5.682 | 0.0371 | 156.892 | 0.852 |
| Everybody’s Somebody’s Fool | Connie Francis | 1960 | 60s | 11 | none | 0.7450 | 0.584 | 0.7370 | 1.93e-05 | 0.2080 | -4.643 | 0.0306 | 84.322 | 0.753 |
| Wild One | Bobby Rydell | 1960 | 60s | 12 | none | 0.6790 | 0.598 | 0.7140 | 3.00e-06 | 0.0539 | -6.981 | 0.0443 | 148.876 | 0.853 |
| Greenfields | The Brothers Four | 1960 | 60s | 13 | none | 0.8630 | 0.466 | 0.1250 | 0.00e+00 | 0.1210 | -20.428 | 0.0372 | 111.430 | 0.338 |
| What in the World’s Come Over You | Jack Scott | 1960 | 60s | 14 | none | 0.8020 | 0.506 | 0.1760 | 3.90e-06 | 0.0978 | -13.962 | 0.0303 | 76.273 | 0.492 |
| El Paso | Marty Robbins | 1960 | 60s | 15 | none | 0.8350 | 0.654 | 0.4520 | 2.89e-05 | 0.1600 | -9.709 | 0.0300 | 106.662 | 0.691 |
| Alley Oop | The Hollywood Argyles | 1960 | 60s | 16 | none | 0.8090 | 0.584 | 0.4760 | 0.00e+00 | 0.0915 | -9.562 | 0.0614 | 63.325 | 0.944 |
| My Heart Has a Mind of Its Own | Connie Francis | 1960 | 60s | 17 | none | 0.8190 | 0.495 | 0.5390 | 0.00e+00 | 0.3250 | -6.088 | 0.0336 | 109.490 | 0.715 |
| Sweet Nothin’s | Brenda Lee | 1960 | 60s | 18 | none | 0.6420 | 0.778 | 0.4130 | 1.04e-03 | 0.1480 | -10.551 | 0.0514 | 125.235 | 0.961 |
| Itsy Bitsy Teenie Weenie Yellow Polka Dot Bikini | Brian Hyland | 1960 | 60s | 19 | none | 0.5580 | 0.814 | 0.4270 | 0.00e+00 | 0.0248 | -11.543 | 0.0918 | 123.109 | 0.964 |
| Only the Lonely | Roy Orbison | 1960 | 60s | 20 | none | 0.3770 | 0.570 | 0.5290 | 5.09e-03 | 0.2030 | -10.769 | 0.0280 | 123.273 | 0.934 |
| Where or When | Dion and the Belmonts | 1960 | 60s | 21 | none | 0.7220 | 0.449 | 0.3950 | 0.00e+00 | 0.1830 | -6.389 | 0.0277 | 110.501 | 0.393 |
| Sixteen Reasons | Connie Stevens | 1960 | 60s | 22 | none | 0.8650 | 0.339 | 0.4060 | 1.10e-05 | 0.1100 | -8.955 | 0.0319 | 109.783 | 0.619 |
| Puppy Love | Paul Anka | 1960 | 60s | 23 | none | 0.6670 | 0.431 | 0.3210 | 0.00e+00 | 0.1920 | -11.827 | 0.0289 | 103.164 | 0.518 |
| Why | Frankie Avalon | 1960 | 60s | 24 | none | 0.7620 | 0.510 | 0.3490 | 0.00e+00 | 0.1300 | -8.677 | 0.0264 | 94.267 | 0.586 |
| Walk Don’t Run | The Ventures | 1960 | 60s | 25 | none | 0.8520 | 0.488 | 0.6480 | 9.14e-01 | 0.1300 | -13.252 | 0.0305 | 156.350 | 0.949 |
| Save the Last Dance for Me | The Drifters | 1960 | 60s | 26 | none | 0.6140 | 0.540 | 0.5300 | 0.00e+00 | 0.1980 | -10.583 | 0.0361 | 143.453 | 0.896 |
| Baby (You’ve Got What It Takes) | Dinah Washington | 1960 | 60s | 27 | none | 0.8520 | 0.670 | 0.5960 | 2.03e-03 | 0.6530 | -9.347 | 0.0627 | 133.396 | 0.813 |
| Sink the Bismarck | Johnny Horton | 1960 | 60s | 28 | none | 0.6520 | 0.680 | 0.5700 | 1.18e-05 | 0.0565 | -12.388 | 0.0958 | 115.894 | 0.966 |
| Chain Gang | Sam Cooke | 1960 | 60s | 29 | none | 0.7300 | 0.703 | 0.7240 | 0.00e+00 | 0.5180 | -10.818 | 0.0467 | 131.821 | 0.963 |
| Let It Be Me | The Everly Brothers | 1960 | 60s | 30 | none | 0.7700 | 0.471 | 0.1900 | 3.85e-03 | 0.1290 | -16.046 | 0.0280 | 72.764 | 0.305 |
| Good Timin’ | Jimmy Jones | 1960 | 60s | 31 | none | 0.6090 | 0.552 | 0.5620 | 1.25e-04 | 0.0930 | -7.682 | 0.0412 | 147.384 | 0.971 |
| Beyond the Sea | Bobby Darin | 1960 | 60s | 32 | none | 0.7230 | 0.521 | 0.5160 | 0.00e+00 | 0.2570 | -7.456 | 0.0369 | 136.483 | 0.569 |
| Go, Jimmy, Go | Jimmy Clanton | 1960 | 60s | 33 | none | 0.2420 | 0.487 | 0.6670 | 0.00e+00 | 0.1050 | -4.783 | 0.0358 | 137.492 | 0.782 |
| Night | Jackie Wilson | 1960 | 60s | 34 | none | 0.9180 | 0.284 | 0.4980 | 9.70e-04 | 0.3610 | -5.534 | 0.0296 | 99.075 | 0.327 |
| Burning Bridges | Jack Scott | 1960 | 60s | 35 | none | 0.8000 | 0.471 | 0.2500 | 3.23e-04 | 0.1240 | -14.637 | 0.0246 | 82.647 | 0.232 |
| The Big Hurt | Toni Fisher | 1960 | 60s | 36 | none | 0.8750 | 0.511 | 0.5240 | 0.00e+00 | 0.1220 | -14.652 | 0.0395 | 127.444 | 0.284 |
| Because They’re Young | Duane Eddy | 1960 | 60s | 37 | none | 0.0122 | 0.701 | 0.6230 | 7.43e-01 | 0.0858 | -10.508 | 0.0332 | 120.567 | 0.848 |
| Lonely Blue Boy | Conway Twitty | 1960 | 60s | 38 | none | 0.7770 | 0.659 | 0.3420 | 1.26e-05 | 0.2480 | -13.709 | 0.0353 | 112.878 | 0.677 |
| Pretty Blue Eyes | Steve Lawrence | 1960 | 60s | 39 | none | 0.4970 | 0.477 | 0.5070 | 0.00e+00 | 0.3600 | -11.802 | 0.0328 | 126.419 | 0.864 |
| Way Down Yonder in New Orleans | Freddy Cannon | 1960 | 60s | 40 | none | 0.7490 | 0.453 | 0.7140 | 2.00e-06 | 0.1240 | -9.083 | 0.0475 | 142.737 | 0.881 |
| Paper Roses | Anita Bryant | 1960 | 60s | 41 | none | 0.9020 | 0.350 | 0.1240 | 2.07e-05 | 0.1340 | -18.923 | 0.0339 | 117.169 | 0.320 |
| Mr. Custer | Larry Verne | 1960 | 60s | 42 | none | 0.7070 | 0.697 | 0.6970 | 8.72e-05 | 0.1560 | -4.863 | 0.0537 | 108.500 | 0.582 |
| I Want to Be Wanted | Brenda Lee | 1960 | 60s | 43 | none | 0.5930 | 0.515 | 0.4120 | 0.00e+00 | 0.2180 | -6.322 | 0.0277 | 107.610 | 0.392 |
| Mule Skinner Blues | The Fendermen | 1960 | 60s | 44 | none | 0.8140 | 0.510 | 0.5920 | 0.00e+00 | 0.0933 | -6.387 | 0.0384 | 128.880 | 0.545 |
| Cradle of Love | Johnny Preston | 1960 | 60s | 45 | none | 0.2720 | 0.450 | 0.5920 | 1.81e-02 | 0.0852 | -6.740 | 0.0760 | 173.986 | 0.796 |
| You Got What It Takes | Marv Johnson | 1960 | 60s | 46 | none | 0.7140 | 0.740 | 0.7050 | 2.82e-05 | 0.0956 | -7.994 | 0.0460 | 130.348 | 0.960 |
| Please Help Me, I’m Falling | Hank Locklin | 1960 | 60s | 47 | none | 0.8280 | 0.552 | 0.3360 | 0.00e+00 | 0.3240 | -10.712 | 0.0311 | 105.479 | 0.553 |
| Love You So | Ron Holden | 1960 | 60s | 48 | none | 0.8900 | 0.634 | 0.2750 | 1.11e-03 | 0.1370 | -11.417 | 0.0348 | 127.294 | 0.581 |
| Finger Poppin’ Time | Hank Ballard & The Midnighters | 1960 | 60s | 49 | none | 0.0735 | 0.533 | 0.7590 | 0.00e+00 | 0.3150 | -5.199 | 0.0356 | 159.373 | 0.889 |
| Harbor Lights | The Platters | 1960 | 60s | 50 | none | 0.9130 | 0.260 | 0.2900 | 6.49e-04 | 0.1530 | -13.380 | 0.0293 | 78.797 | 0.303 |
During manual check of data, I removed songs that were not in the Spotify database. This should have eliminated any chance of missing variables. Let’s double check for NA values. If present, drop the columns.
apply(data, 2, function(x) any(is.na(x)))
## songname artistname year decade
## FALSE FALSE FALSE FALSE
## position win acousticness danceability
## FALSE FALSE FALSE FALSE
## energy instrumentalness liveness loudness
## FALSE FALSE FALSE FALSE
## speechiness tempo valence
## FALSE FALSE FALSE
There are no missing values. We are ready to investigate how our variables interact with each other
Let’s pivot the data from wide to long so that all AFTS are under one
variable, ranking. This will allow me to build graphics
better during this EDA. Here are the first 50 rows of that reshaped data
set.
#pivoted music_hist frame for further investigation
data_mod <- music_hist %>%
pivot_longer(
cols = c("acousticness", "danceability", "energy", "instrumentalness",
"liveness", "loudness", "speechiness", "tempo", "valence"),
names_to = "aft",
values_to = "rating"
)
head(data_mod, n = 50) %>%
kable %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = '700px', height = "300px")
| songname | artistname | year | decade | position | win | aft | rating |
|---|---|---|---|---|---|---|---|
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | acousticness | 0.5910000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | danceability | 0.3260000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | energy | 0.3260000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | instrumentalness | 0.9180000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | liveness | 0.4940000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | loudness | -15.1440000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | speechiness | 0.0297000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | tempo | 186.2320000 |
| Theme from A Summer Place | Percy Faith | 1960 | 60s | 1 | won | valence | 0.8700000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | acousticness | 0.9090000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | danceability | 0.5540000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | energy | 0.1860000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | instrumentalness | 0.0014400 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | liveness | 0.1100000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | loudness | -15.8460000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | speechiness | 0.0379000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | tempo | 81.1810000 |
| He’ll Have to Go | Jim Reeves | 1960 | 60s | 2 | none | valence | 0.2000000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | acousticness | 0.5480000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | danceability | 0.5070000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | energy | 0.5610000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | instrumentalness | 0.0000000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | liveness | 0.5900000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | loudness | -10.4720000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | speechiness | 0.0353000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | tempo | 120.0420000 |
| Cathy’s Clown | The Everly Brothers | 1960 | 60s | 3 | none | valence | 0.8900000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | acousticness | 0.7750000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | danceability | 0.7600000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | energy | 0.4680000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | instrumentalness | 0.0000236 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | liveness | 0.1840000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | loudness | -8.9570000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | speechiness | 0.0482000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | tempo | 119.9860000 |
| Running Bear | Johnny Preston | 1960 | 60s | 4 | none | valence | 0.7450000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | acousticness | 0.9360000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | danceability | 0.5690000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | energy | 0.0638000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | instrumentalness | 0.0000000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | liveness | 0.1220000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | loudness | -18.5480000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | speechiness | 0.0458000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | tempo | 101.5210000 |
| Teen Angel | Mark Dinning | 1960 | 60s | 5 | none | valence | 0.2820000 |
| I’m Sorry | Brenda Lee | 1960 | 60s | 6 | none | acousticness | 0.8680000 |
| I’m Sorry | Brenda Lee | 1960 | 60s | 6 | none | danceability | 0.5580000 |
| I’m Sorry | Brenda Lee | 1960 | 60s | 6 | none | energy | 0.2230000 |
| I’m Sorry | Brenda Lee | 1960 | 60s | 6 | none | instrumentalness | 0.0009720 |
| I’m Sorry | Brenda Lee | 1960 | 60s | 6 | none | liveness | 0.1300000 |
aft_avg <- data_mod %>% group_by(aft) %>%
summarise_at(vars(rating), list(rating = mean))
aft_avg %>%
kable %>%
kable_styling("striped", full_width = F)
| aft | rating |
|---|---|
| acousticness | 0.2646085 |
| danceability | 0.6259299 |
| energy | 0.6149965 |
| instrumentalness | 0.0374943 |
| liveness | 0.1759030 |
| loudness | -8.4695172 |
| speechiness | 0.0724138 |
| tempo | 119.4077783 |
| valence | 0.5988204 |
aft_hist <- ggplot(data=aft_avg[-c(6,8),], aes(reorder(x= aft, -rating),
y=rating, fill=aft)) +
geom_bar(stat="identity") +
xlab("Audio Feature") +
ylab("Spotify Rating (0.00-1.0)")
aft_hist
Here we will examine how audio features interact with each
other.
- Correlation matrix of audio features
music_hist[-c(4)] %>%
dplyr::select(where(is.numeric)) %>%
cor() %>%
corrplot(type="lower", diag= FALSE, method = 'color')
Music has changed drastically over time. Thus audio feature trends may have fluctuated over time. Let’s explore a decade-decade comparison of audio feature averages. It is important to consider if year should be controlled for/is an influential variable.
#Decades-Audio Feature Table:
options(dplyr.summarise.inform = FALSE)
dec_avg <- data_mod[-c(1,2,3,5,6)] %>%
group_by(decade,aft) %>%
dplyr::summarise(mean = mean(rating))
dec_table <- data.frame(dec_avg)
dec_table <- reshape2::dcast(dec_avg, decade ~ aft)
## Using mean as value column: use value.var to override.
dec_table %>%
kable %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = '700px', height = "300px")
| decade | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence |
|---|---|---|---|---|---|---|---|---|---|
| 60s | 0.5050905 | 0.5458441 | 0.5167829 | 0.0627306 | 0.1974617 | -10.339513 | 0.0495471 | 120.3426 | 0.6745975 |
| 70s | 0.3464703 | 0.5871392 | 0.5692609 | 0.0483917 | 0.1830250 | -10.503112 | 0.0495957 | 118.4017 | 0.6640022 |
| 80s | 0.2324466 | 0.6337428 | 0.6250042 | 0.0393305 | 0.1606487 | -9.865360 | 0.0441348 | 120.5321 | 0.6389495 |
| 90s | 0.2067170 | 0.6483998 | 0.6183073 | 0.0320656 | 0.1706651 | -8.714538 | 0.0729747 | 116.8630 | 0.5607160 |
| 200s | 0.1413434 | 0.6635125 | 0.7014338 | 0.0085183 | 0.1740826 | -5.714801 | 0.1085999 | 118.4128 | 0.5720853 |
| 2010s | 0.1601512 | 0.6676283 | 0.6622965 | 0.0224240 | 0.1666712 | -6.010614 | 0.0988537 | 122.2088 | 0.4998555 |
| 2020s | 0.2508438 | 0.6729615 | 0.5793221 | 0.1008203 | 0.1858822 | -7.174293 | 0.1181976 | 119.5157 | 0.4830072 |
acous_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'acousticness')
acous_dec <- ggplot(acous_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
dance_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'danceability')
dance_dec <- ggplot(dance_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
energy_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'energy')
energy_dec <- ggplot(energy_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
instrum_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'instrumentalness')
instrum_dec <- ggplot(instrum_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
live_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'liveness')
live_dec <- ggplot(live_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
loud_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'loudness')
loud_dec <- ggplot(loud_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
speech_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'speechiness')
speech_dec <- ggplot(speech_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
tempo_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'tempo')
tempo_dec <- ggplot(tempo_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
val_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'valence')
val_dec <- ggplot(val_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.position="none")
dummy_dec <- ggplot(val_df) +
geom_boxplot(aes(x=aft, y=rating, color = decade))+
theme(legend.direction = "vertical")
leg <- cowplot::get_legend(dummy_dec)
r1 <- plot_grid(acous_dec, dance_dec)
r2 <- plot_grid(energy_dec, instrum_dec)
r3 <- plot_grid(live_dec, speech_dec)
r4 <- plot_grid(val_dec, tempo_dec)
r5 <- plot_grid(leg, loud_dec)
r5
plot_grid(r1,r2, nrow = 2)
plot_grid(r3,r4, nrow = 2)
There are some obvious changes in the decade averages seen through these
boxplots for some AFTS. Overtime songs have certainly become more loud.
I am assuming this is due to technology. Song have become less acoustic.
This is probably also due to the new wave of electronic technology and
insturmentation. Valence (or the happiness of a song) seems to be
decreasing. This is an interesting feature to note. The other AFTS don’t
seems to have much change or a certain trend overtime.Overall, while
some AFTS on average don’t change much over time, there are some that
certainly do. We mark music by a decade: 60’s is Rock n Roll, 70’s
Disco, 80’s is.. ambigious, etc. Thus, as we can see there are some AFTS
with distinct decade features. This should be considered when building
the models.
Let’s explore how ranking on the Billboards Year-End chart relate to the AFTS on average. It would be interesting to see if there are any obvious trends as to what puts a song near the top 10.
bill<-data_mod %>%
dplyr::select(-c(1,2,3,4,6)) %>%
group_by(position, aft) %>%
summarise_at(vars(rating), list(rating = mean))
acous_b <- bill[bill$aft %in% "acousticness", ]
b1<- ggplot(acous_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "acousticness")
dance_b <- bill[bill$aft %in% "danceability", ]
b2 <- ggplot(dance_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "danceability")
energy_b <- bill[bill$aft %in% "energy", ]
b3 <- ggplot(energy_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm)+
labs(title = "energy")
instrum_b <- bill[bill$aft %in% "instrumentalness", ]
b4 <- ggplot(instrum_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "instrumentalness")
live_b <- bill[bill$aft %in% "liveness", ]
b5 <- ggplot(live_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "liveness")
speech_b <- bill[bill$aft %in% "speechiness", ]
b6 <-ggplot(speech_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "speechiness")
val_b <- bill[bill$aft %in% "valence", ]
b7 <- ggplot(val_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "valence")
tempo_b <- bill[bill$aft %in% "tempo",]
b8 <- ggplot(tempo_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "tempo")
loud_b <- bill[bill$aft %in% "loudness",]
b9 <- ggplot(loud_b, aes(x = position, y = rating)) +
geom_line() +
geom_smooth(method = lm) +
labs(title = "loudness")
plot_grid(b1,b2,b3,b4,b5,b6,b7,b8,b9, nrow = 3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
There are not very noticable trends in the ranking of the Billboards Top
Charts and the AFTS. Most noticeably is the negative relationship in
danceability and loudness with ranking. So perhaps more popular songs
tend to be more danceable and louder on average.
Next let’s take a look at some density histograms for the AFTS, grouped by Grammy status. Here we can see if the distribution for AFTS is different at a glance for each audio feature. This may give insight as to if we can expect audio features to be good predictors for the Grammy’s Song of the Year.
gram<-data_mod %>%
dplyr::select(-c(1,2,3,4,5))
acous_g <- gram[gram$aft %in% "acousticness", ]
g1 <- ggplot(acous_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title = "acousticness")
dance_g <- gram[gram$aft %in% "danceability", ]
g2 <- ggplot(dance_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title ="danceability")
energy_g <- gram[gram$aft %in% "energy", ]
g3 <- ggplot(energy_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title ="energy")
instrum_g <- gram[gram$aft %in% "instrumentalness", ]
g4 <- ggplot(instrum_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title ="instrumentalness")
live_g <- gram[gram$aft %in% "liveness", ]
g5 <- ggplot(live_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title ="liveness")
speech_g <- gram[gram$aft %in% "speechiness", ]
g6 <- ggplot(speech_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title = "speechiness")
val_g <- gram[gram$aft %in% "valence", ]
g7 <- ggplot(val_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title = "valence")
tempo_g <- gram[gram$aft %in% "tempo", ]
g8 <- ggplot(tempo_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title = "tempo")
loud_g <- gram[gram$aft %in% "loudness", ]
g9 <- ggplot(loud_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4) +
theme(legend.position="none") +
labs(title = "loudness")
gdummy <- ggplot(acous_g, aes(x=rating, group=win, fill = win)) +
geom_density(adjust = 2, alpha = 0.4)
leg2 <- legend <- cowplot::get_legend(gdummy)
x1 <- plot_grid(leg2,g9)
x2 <- plot_grid(g1,g2)
x3 <- plot_grid(g3,g4)
x4 <- plot_grid(g5,g6)
x5 <- plot_grid(g7,g8)
x1
plot_grid(x2,x3, nrow = 2)
plot_grid(x4,x5, nrow = 2)
The AFTS each have a very similar shape for whether the song won, is
nominated, or neither for Song of the Year. Thus it may be fair to think
that our models may have a hard time predicting the categories. That
said, let’s check out a few classification models, and see if AFTS are
good predictors for Grammy Song of the Year status.
Our first set of models will intend to predict the Grammy’s Song of the Year status. We wll use the following protocol:
Let’s begin my splitting our data. We will stratify our training set and test set with the variable ‘win’. Since there is only one winner for Song of the Year per Year, we want to make sure these get distributed evenly in our split.
set.seed(1027)
grammys_split <- music_hist %>%
initial_split(prop = .8, strata = "win")
grammys_train <- training(grammys_split)
grammys_test <- testing(grammys_split)
dim(grammys_train)
## [1] 5032 15
dim(grammys_test)
## [1] 1259 15
grammys_folds <- vfold_cv(grammys_train, strata = "win", v = 10)
grammys_recipe <- recipe(win ~ year + acousticness + danceability + # predict position using AFTS
energy + instrumentalness + liveness +
loudness + speechiness + tempo + valence,
grammys_train) %>%
step_dummy(all_nominal_predictors()) %>% # make sure all nominal variables are noted accordingly
step_interact(~energy:acousticness + danceability:acousticness + # created interactions based on the most correlated
valence:danceability + energy:loudness + # AFTS from the correlation matrix
loudness:acousticness) %>%
step_center(all_predictors()) %>% # Center and scale our variables
step_scale(all_predictors())
Using the recipe above I will fit the following 4 models:
In this random forest model we will use the ranger
engine, set importance to impurity, set the
mode to classification, and tune
mtry, trees, and min_n. Next I
set up a tuning grid with ranges for the tuned hyperparamters. Finally,
I saved the tuned and fit model.
# Specs and workflow
rf_spec_g <- rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_wf_g <- workflow() %>%
add_model(rf_spec_g %>%
set_args(mtry = tune(),
trees = tune(),
min_n = tune())) %>%
add_recipe(grammys_recipe)
# Tuning grid
param_grid_rf_g <- grid_regular(mtry(range = c(1,10)), trees(range= c(1,5)),
min_n(range = c(3,10)), levels = 10)
tune_rf_g <- tune_grid(
rf_wf_g,
resamples = grammys_folds,
grid = param_grid_rf_g)
# save model (to avoid refitting later)
save(tune_rf_g, rf_wf_g, file = "data/model_fitting/tune_rf_g.rda")
In this nearest neighbors model we will use the kknn
engine, set the mode to classification, and
tune neighbors. Next I set up a tuning grid with a tuned
neighbors hyperparameter. Finally, I saved the tuned and fit model.
knn_spec_g <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
knn_wf_g <- workflow() %>%
add_model(knn_spec_g %>% set_args(neighbors = tune())) %>%
add_recipe(grammys_recipe)
# Tuning grid
param_grid_knn_g <- grid_regular(neighbors(), levels = 10)
tune_knn_g <- tune_grid(
knn_wf_g,
resamples = grammys_folds,
grid = param_grid_knn_g)
# save model (to avoid refitting later)
save(tune_knn_g, knn_wf_g, file = "data/model_fitting/tune_knn_g.rda")
In this SVM we will use the kernlab engine, set the
mode to classification, and tune
cost complexity. Next I set up a tuning grid with ranges
for the cost hyperparamter. Finally, I saved the tuned and fit
model.
# Specs and workflow
svm_spec_g <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_wf_g <- workflow() %>%
add_model(svm_spec_g %>% set_args(cost = tune())) %>%
add_recipe(grammys_recipe)
# Tuning grid
param_grid_svm_g <- grid_regular(cost(), levels = 10)
tune_svm_g <- tune_grid(
svm_wf_g,
resamples = grammys_folds,
grid = param_grid_svm_g)
# Save model (to avoid refitting later)
save(tune_svm_g, svm_wf_g, file = "data/model_fitting/tune_svm_g.rda")
In this Lasso model we will use the glmnet engine, set
mixture to 1 to indicate a Lasso
regularization, set the mode to
classification, and tune penalty. Next I set
up a tuning grid with ranges for the penalty hyperparamter. Finally, I
saved the tuned and fit model.
# Specs and Workflow
lasso_spec_g <-
multinom_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
lasso_wf_g<- workflow() %>%
add_recipe(grammys_recipe) %>%
add_model(lasso_spec_g)
# Tuning grid
penalty_grid_lasso_g <- grid_regular(penalty(range = c(-10, 10)), levels = 10)
tune_lasso_g <- tune_grid(
lasso_wf_g,
resamples = grammys_folds,
grid = penalty_grid_lasso_g
)
# Save model (to avoid refitting later)
save(tune_lasso_g, lasso_wf_g, file = "data/model_fitting/tune_lasso_g.rda")
Since we saved our models to avoid refitting, we must load them in the following steps.
load("data/model_fitting/tune_rf_g.rda")
load("data/model_fitting/tune_knn_g.rda")
load("data/model_fitting/tune_svm_g.rda")
load("data/model_fitting/tune_lasso_g.rda")
autoplot(tune_rf_g, metric = 'roc_auc')
show_best(tune_rf_g, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 2 3 roc_auc hand_till 0.550 10 0.0151 Preprocessor1_Model0…
## 2 3 2 6 roc_auc hand_till 0.546 10 0.0266 Preprocessor1_Model1…
## 3 2 2 5 roc_auc hand_till 0.541 10 0.0180 Preprocessor1_Model1…
## 4 9 3 8 roc_auc hand_till 0.540 10 0.0193 Preprocessor1_Model2…
## 5 2 5 10 roc_auc hand_till 0.539 10 0.0214 Preprocessor1_Model3…
From the show_best(), the highest AUC mean
is 0.5499734 where mtry is 1,
trees is 2, and min_n is 3.
This means that this model had ~55% correct predictions. This
is not that high, but I would say from our EDA this is fairly
expected.
autoplot(tune_knn_g, metric = 'roc_auc')
show_best(tune_knn_g, metric = "roc_auc")
## # A tibble: 5 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 roc_auc hand_till 0.501 10 0.00623 Preprocessor1_Model05
## 2 1 roc_auc hand_till 0.500 10 0.00384 Preprocessor1_Model01
## 3 6 roc_auc hand_till 0.497 10 0.00620 Preprocessor1_Model06
## 4 2 roc_auc hand_till 0.497 10 0.00415 Preprocessor1_Model02
## 5 3 roc_auc hand_till 0.496 10 0.00472 Preprocessor1_Model03
From the show_best(), the highest AUC mean
is 0.5008696 where neighbors is 5. This
means that this model had ~50% correct predictions. This is
even less than the RF model selected above.
autoplot(tune_svm_g, metric = 'roc_auc')
show_best(tune_svm_g, metric = "roc_auc")
## # A tibble: 5 × 7
## cost .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3.17 roc_auc hand_till 0.598 10 0.0183 Preprocessor1_Model08
## 2 0.0992 roc_auc hand_till 0.591 10 0.0191 Preprocessor1_Model05
## 3 0.315 roc_auc hand_till 0.584 10 0.0162 Preprocessor1_Model06
## 4 32 roc_auc hand_till 0.584 10 0.0208 Preprocessor1_Model10
## 5 10.1 roc_auc hand_till 0.583 10 0.0154 Preprocessor1_Model09
From the show_best(), the highest AUC mean
is 0.5978743 where cost is 3.17. This
means that this model had ~60% correct predictions. That is
quite the improvement. So far this is the best model.
autoplot(tune_lasso_g, metric = 'roc_auc')
show_best(tune_lasso_g, metric = "roc_auc")
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0774 roc_auc hand_till 0.5 10 0 Preprocessor1_Model05
## 2 12.9 roc_auc hand_till 0.5 10 0 Preprocessor1_Model06
## 3 2154. roc_auc hand_till 0.5 10 0 Preprocessor1_Model07
## 4 359381. roc_auc hand_till 0.5 10 0 Preprocessor1_Model08
## 5 59948425. roc_auc hand_till 0.5 10 0 Preprocessor1_Model09
From the show_best(), the highest AUC mean
is 0.5 where penalty is .0074. This means that
this model had ~50% correct predictions. This is not the best
model to choose from.
The SVM model performed the best. Let’s use this in our final workflow.
Our best performing model was the random forest neighbors model! Next
we will create a final workflow with the best nearest neighbors model
using select_best().
svm_wf_tuned_g <- svm_wf_g %>%
finalize_workflow(select_best(tune_svm_g, metric = 'roc_auc'))
svm_g_fit <- fit(svm_wf_tuned_g, grammys_train)
We will now fit the finalized model to our test data, and see how it performs!
aug <- augment(svm_g_fit, new_data = grammys_test)
tbl_g <- aug %>% roc_auc(truth = win, estimate =c(
.pred_none, .pred_nominated, .pred_won))
roc_curv_g <- aug %>% roc_curve(truth = win, estimate =c(
.pred_none, .pred_nominated, .pred_won)) %>% autoplot()
final_plot_g <- aug %>% conf_mat(truth = win, estimate =.pred_class) %>%
autoplot(type = "heatmap")
tbl_g
roc_curv_g
final_plot_g
The model returned a AUC of 0.4931. This is ~0.1 less than what we got on the training set. The SVM model did not perform too well. The ROC curves and confusion matrix visualize this sentiment. It seems that the model only wants to categorize our data into the ‘none’ bucket. From our EDA, we did see that there was not a huge difference in distribution for the desntiy histograms of the AFTS. So, it seems that perhaps there may not be a big enough difference in the categories’ audio features after all. In other words, there is not a great distinction in audio features for nominated and winning songs. The SVM may have chosen the ‘none’ category because there were far more observations in the entire data set to begin with. While our model does not do a tremendous job predicting anyting, this is telling for the Grammy’s. We can conclude that the Grammy’s Song of the Year category has a fairly diverse taste in music. Nominations and winners may not follow a trend according to the Recording Academy.
While Grammy’s categories may not have been distinct enough in AFTS, perhaps the ML models will have better luck with the Billboard’s rankings. Thus, our next set of models will intend to predict the Position on the Year-End Billboard’ position Note, we left Position as a numerical variable, so some position predictions will be decimal places. We will use the following protocol:
set.seed(1027)
billboards_split <- music_hist %>%
initial_split(prop = .8, strata = "position")
billboards_train <- training(billboards_split)
billboards_test <- testing(billboards_split)
dim(billboards_train)
## [1] 5032 15
dim(billboards_test)
## [1] 1259 15
billboards_folds <- vfold_cv(billboards_train, strata = "position", v = 10)
billboards_recipe <- recipe(position ~ year + acousticness + danceability + # predict position using AFTS
energy + instrumentalness + liveness +
loudness + speechiness + tempo + valence,
billboards_train) %>%
step_dummy(all_nominal_predictors()) %>% # make sure all nominal variables are noted accordingly
step_interact(~energy:acousticness + danceability:acousticness + # created interactions based on the most correlated
valence:danceability + energy:loudness + # AFTS from the correlatoin matrix
loudness:acousticness) %>%
step_center(all_predictors()) %>% # Center and scale our variables
step_scale(all_predictors())
Using the recipe above I will fit the following 4 models:
In this random forest model we will use the ranger
engine, set importance to impurity, set the
mode to regression, and tune
mtry, trees, and min_n. Next I
set up a tuning grid with ranges for the tuned hyperparamters. Finally,
I saved the tuned and fit model.
rf_spec_b <- rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
rf_wf_b <- workflow() %>%
add_model(rf_spec_b %>%
set_args(mtry = tune(),
trees = tune(),
min_n = tune())) %>%
add_recipe(billboards_recipe)
param_grid_rf_b <- grid_regular(mtry(range = c(1,10)), trees(range= c(1,5)),
min_n(range = c(3,10)), levels = 10)
tune_rf_b <- tune_grid(
rf_wf_b,
resamples = billboards_folds,
grid = param_grid_rf_b)
save(tune_rf_b, rf_wf_b, file = "data/model_fitting/tune_rf_b.rda")
In this nearest neighbors model we will use the kknn
engine, set the mode to regression, and tune
neighbors. Next I set up a tuning grid wthe neighbors
hyperparameter. Finally, I saved the tuned and fit model.
knn_spec_b <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression")
knn_wf_b <- workflow() %>%
add_model(knn_spec_b %>% set_args(neighbors = tune())) %>%
add_recipe(billboards_recipe)
param_grid_knn_b <- grid_regular(neighbors(), levels = 10)
tune_knn_b <- tune_grid(
knn_wf_b,
resamples = billboards_folds,
grid = param_grid_knn_b
)
save(tune_knn_b, knn_wf_b, file = "data/model_fitting/tune_knn_b.rda")
In this SVM we will use the kernlab engine, set the
mode to regresion``, and tunecost complexity`.
Next I set up a tuning grid with ranges for the cost hyperparamter.
Finally, I saved the tuned and fit model.
svm_spec_b <- svm_rbf() %>%
set_mode("regression") %>%
set_engine("kernlab")
svm_wf_b <- workflow() %>%
add_model(svm_spec_b %>% set_args(cost = tune())) %>%
add_recipe(billboards_recipe)
param_grid_svm_b <- grid_regular(cost(), levels = 10)
tune_svm_b <- tune_grid(
svm_wf_b,
resamples = billboards_folds,
grid = param_grid_svm_b
)
save(tune_svm_b, svm_wf_b, file = "data/model_fitting/tune_svm_b.rda")
In this Lasso model we will use the glmnet engine, set
mixture to 1 to indicate a Lasso regularization,
set the mode to
regression``, and tunepenalty`. Next I set up a tuning grid
with ranges for the penalty hyperparamter. Finally, I saved the tuned
and fit model.
lasso_spec_b <-
linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
lasso_wf_b <- workflow() %>%
add_recipe(billboards_recipe) %>%
add_model(lasso_spec_b)
penalty_grid_lasso_b <- grid_regular(penalty(), levels = 10)
tune_lasso_b <- tune_grid(
lasso_wf_b,
resamples = grammys_folds,
grid = penalty_grid_lasso_b
)
save(tune_lasso_b, lasso_wf_b, file = "data/model_fitting/tune_lasso_b.rda")
Since we saved our models to avoid refitting, we must load them in the following steps.
load("data/model_fitting/tune_rf_b.rda")
load("data/model_fitting/tune_knn_b.rda")
load("data/model_fitting/tune_svm_b.rda")
load("data/model_fitting/tune_lasso_b.rda")
autoplot(tune_rf_b, metric = 'rmse')
show_best(tune_rf_b, metric = "rmse")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 5 10 rmse standard 29.4 10 0.0725 Preprocessor1_Model3…
## 2 1 4 8 rmse standard 29.4 10 0.0714 Preprocessor1_Model2…
## 3 1 5 7 rmse standard 29.4 10 0.0804 Preprocessor1_Model2…
## 4 1 4 7 rmse standard 29.4 10 0.0596 Preprocessor1_Model2…
## 5 1 2 4 rmse standard 29.4 10 0.0840 Preprocessor1_Model0…
From the show_best(), the lowest RMSE mean
is 29.36425 where mtry is 1,
trees is 5, and min_n is 10.
This is a pretty high value, which may indicate that AFTS may not be
great predictors for the Billboard’s data either. However, let’s look at
the other models.
autoplot(tune_knn_b, metric = 'rmse')
show_best(tune_knn_b, metric = "rmse")
## # A tibble: 5 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 rmse standard 32.1 10 0.148 Preprocessor1_Model10
## 2 9 rmse standard 32.4 10 0.157 Preprocessor1_Model09
## 3 8 rmse standard 32.7 10 0.169 Preprocessor1_Model08
## 4 7 rmse standard 33.1 10 0.183 Preprocessor1_Model07
## 5 6 rmse standard 33.6 10 0.196 Preprocessor1_Model06
From the show_best(), the lowest RMSE mean
is 32.11814 where neighbors is 10. The
RMSE is higher, so this nearest neighbors model performed worse.
autoplot(tune_svm_b, metric = 'rmse')
show_best(tune_svm_b, metric = "rmse")
## # A tibble: 5 × 7
## cost .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000977 rmse standard 29.3 10 0.0644 Preprocessor1_Model01
## 2 0.00310 rmse standard 29.4 10 0.0643 Preprocessor1_Model02
## 3 0.00984 rmse standard 29.4 10 0.0644 Preprocessor1_Model03
## 4 0.0312 rmse standard 29.5 10 0.0674 Preprocessor1_Model04
## 5 0.0992 rmse standard 29.6 10 0.0784 Preprocessor1_Model05
From the show_best(), the lowest RMSE mean
is 29.34608 where cost is 0.0009765625.
The RMSE is lowest thus far, so this is the best model so far.
autoplot(tune_lasso_b, metric = 'rmse')
show_best(tune_lasso_b, metric = "rmse")
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 29.3 10 0.169 Preprocessor1_Model10
## 2 0.0774 rmse standard 29.6 10 0.161 Preprocessor1_Model09
## 3 0.00599 rmse standard 29.7 10 0.147 Preprocessor1_Model08
## 4 0.000464 rmse standard 29.7 10 0.146 Preprocessor1_Model07
## 5 0.0000000001 rmse standard 29.7 10 0.145 Preprocessor1_Model01
From the show_best(), the lowest RMSE mean
is 29.34967 where penalty is 1.0. The
RMSE is just a bit higher than our SVM model.
Thus we will continue with the SVM model in our finalized work flow.
Our best performing model was the nearest neighbors model! Next we
will create a final workflow with the best nearest neighbors model using
select_best().
svm_wf_tuned_b <- knn_wf_b %>%
finalize_workflow(select_best(tune_svm_b, metric = 'rmse'))
svm_b_fit <- fit(svm_wf_tuned_b, billboards_train)
## Warning: tune samples were requested but there were 5032 rows in the data. 5027
## will be used.
billboards_metric <- metric_set(rmse, rsq)
predictions <- predict(svm_b_fit, new_data = billboards_test)
position <- billboards_test['position']
pos_dec <- billboards_test[c('position','decade')]
test <- billboards_test[c('songname','artistname', 'year','decade', 'position')]
model_test_predictions <- cbind(position, predictions)
mod_test_pred_info <- cbind(test, predictions)
#RMSE of Test
model_test_predictions %>%
billboards_metric(truth = position, estimate = .pred)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 29.3
## 2 rsq standard 0.00440
# adding a difference column
mod_test_pred_info['difference'] = (model_test_predictions$position
- model_test_predictions$.pred)
tbl2a <-mod_test_pred_info %>% mutate(difference = abs(difference))
tbl2b <- tbl2a %>%
filter(difference <= 10)
tbl2b %>%
kable %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = '700px', height = "300px")
| songname | artistname | year | decade | position | .pred | difference |
|---|---|---|---|---|---|---|
| Theme from The Apartment | Ferrante & Teicher | 1960 | 60s | 53 | 51.38268 | 1.6173248 |
| The Village of St. Bernadette | Andy Williams | 1960 | 60s | 56 | 51.41583 | 4.5841744 |
| I Love How You Love Me | The Paris Sisters | 1961 | 60s | 52 | 51.54716 | 0.4528377 |
| The Way You Look Tonight | The Lettermen | 1961 | 60s | 58 | 51.48512 | 6.5148807 |
| Crying in the Rain | The Everly Brothers | 1962 | 60s | 47 | 50.90827 | 3.9082734 |
| Shout! Shout! (Knock Yourself Out) | Ernie Maresca | 1962 | 60s | 49 | 50.85376 | 1.8537558 |
| Smoky Places | The Corsairs | 1962 | 60s | 51 | 51.53550 | 0.5354984 |
| Green Onions | Booker T. & the M.G.’s | 1962 | 60s | 53 | 50.53243 | 2.4675685 |
| Mean Woman Blues | Roy Orbison | 1963 | 60s | 53 | 51.68449 | 1.3155117 |
| Memphis | Johnny Rivers | 1964 | 60s | 41 | 50.77643 | 9.7764317 |
| White on White | Danny Williams | 1964 | 60s | 42 | 50.68295 | 8.6829499 |
| Um, Um, Um, Um, Um, Um | Major Lance | 1964 | 60s | 47 | 50.50596 | 3.5059583 |
| I’m Telling You Now | Freddie and the Dreamers | 1965 | 60s | 42 | 50.87109 | 8.8710941 |
| The Seventh Son | Johnny Rivers | 1965 | 60s | 45 | 51.66070 | 6.6607016 |
| Down in the Boondocks | Billy Joe Royal | 1965 | 60s | 52 | 50.89762 | 1.1023760 |
| Hooray for Hazel | Tommy Roe | 1966 | 60s | 44 | 51.01911 | 7.0191095 |
| We Can Work It Out | The Beatles | 1966 | 60s | 49 | 51.94941 | 2.9494080 |
| Mercy, Mercy, Mercy | The Buckinghams | 1967 | 60s | 51 | 50.96612 | 0.0338780 |
| (Your Love Keeps Lifting Me) Higher and Higher | Jackie Wilson | 1967 | 60s | 53 | 50.81787 | 2.1821331 |
| Georgy Girl | The Seekers | 1967 | 60s | 57 | 51.53067 | 5.4693283 |
| California Nights | Lesley Gore | 1967 | 60s | 61 | 51.86240 | 9.1376004 |
| Classical Gas | Mason Williams | 1968 | 60s | 43 | 51.12956 | 8.1295621 |
| (Sweet Sweet Baby) Since You’ve Been Gone | Aretha Franklin | 1968 | 60s | 46 | 50.46195 | 4.4619486 |
| Reach Out of the Darkness | Friend & Lover | 1968 | 60s | 49 | 51.25209 | 2.2520852 |
| Jumpin’ Jack Flash | The Rolling Stones | 1968 | 60s | 50 | 51.61875 | 1.6187460 |
| MacArthur Park | Richard Harris | 1968 | 60s | 51 | 51.71059 | 0.7105897 |
| Take Time to Know Her | Percy Sledge | 1968 | 60s | 54 | 51.50976 | 2.4902437 |
| I Got the Feelin’ | James Brown | 1968 | 60s | 58 | 50.47618 | 7.5238154 |
| I’ve Gotta Be Me | Sammy Davis Jr. | 1969 | 60s | 51 | 51.14837 | 0.1483681 |
| Runaway Child, Running Wild | The Temptations | 1969 | 60s | 57 | 51.51656 | 5.4834351 |
| Galveston | Glen Campbell | 1969 | 60s | 59 | 51.33192 | 7.6680816 |
| Reflections of My Life | Marmalade | 1970 | 70s | 43 | 51.68454 | 8.6845421 |
| Hey There Lonely Girl | Eddie Holman | 1970 | 70s | 44 | 51.28802 | 7.2880174 |
| He Ain’t Heavy, He’s My Brother | The Hollies | 1970 | 70s | 46 | 51.54880 | 5.5488023 |
| Come and Get It | Badfinger | 1970 | 70s | 48 | 50.96576 | 2.9657611 |
| Turn Back the Hands of Time | Tyrone Davis | 1970 | 70s | 51 | 50.99237 | 0.0076285 |
| In the Summertime | Mungo Jerry | 1970 | 70s | 53 | 50.58570 | 2.4142979 |
| Bridge over Troubled Water | Aretha Franklin | 1971 | 70s | 52 | 51.06465 | 0.9353502 |
| Draggin’ the Line | Tommy James | 1971 | 70s | 54 | 51.01941 | 2.9805893 |
| Stay Awhile | The Bells | 1971 | 70s | 57 | 51.05334 | 5.9466606 |
| Sweet City Woman | The Stampeders | 1971 | 70s | 58 | 50.70168 | 7.2983220 |
| If | Bread | 1971 | 70s | 61 | 51.23475 | 9.7652470 |
| Morning Has Broken | Cat Stevens | 1972 | 70s | 44 | 51.33553 | 7.3355334 |
| I Can See Clearly Now | Johnny Nash | 1972 | 70s | 47 | 50.79851 | 3.7985071 |
| Jungle Fever | The Chakachas | 1972 | 70s | 51 | 51.60232 | 0.6023159 |
| Where Is the Love | Roberta Flack & Donny Hathaway | 1972 | 70s | 58 | 51.68236 | 6.3176354 |
| Give Me Love (Give Me Peace on Earth) | George Harrison | 1973 | 70s | 42 | 51.69420 | 9.6942045 |
| Feelin’ Stronger Every Day | Chicago | 1973 | 70s | 54 | 52.04427 | 1.9557290 |
| I Believe in You (You Believe in Me) | Johnnie Taylor | 1973 | 70s | 58 | 50.84781 | 7.1521935 |
| Mockingbird | Carly Simon & James Taylor | 1974 | 70s | 52 | 51.49158 | 0.5084160 |
| Never, Never Gonna Give You Up | Barry White | 1974 | 70s | 55 | 51.41254 | 3.5874571 |
| Chevy Van | Sammy Johns | 1975 | 70s | 48 | 51.82902 | 3.8290166 |
| Shannon | Henry Gross | 1976 | 70s | 47 | 51.35771 | 4.3577075 |
| Devil Woman | Cliff Richard | 1976 | 70s | 55 | 50.54941 | 4.4505875 |
| Lucille | Kenny Rogers | 1977 | 70s | 43 | 51.05996 | 8.0599637 |
| Handy Man | James Taylor | 1977 | 70s | 46 | 50.80201 | 4.8020135 |
| I Wish | Stevie Wonder | 1977 | 70s | 51 | 51.04956 | 0.0495606 |
| After the Lovin’ | Engelbert Humperdinck | 1977 | 70s | 61 | 51.70819 | 9.2918065 |
| Love Will Find a Way | Pablo Cruise | 1978 | 70s | 44 | 51.36557 | 7.3655687 |
| Love Is in the Air | John Paul Young | 1978 | 70s | 46 | 51.43842 | 5.4384207 |
| Thunder Island | Jay Ferguson | 1978 | 70s | 50 | 51.64689 | 1.6468871 |
| Here You Come Again | Dolly Parton | 1978 | 70s | 60 | 51.09256 | 8.9074401 |
| She Believes in Me | Kenny Rogers | 1979 | 70s | 47 | 51.83634 | 4.8363442 |
| In the Navy | Village People | 1979 | 70s | 48 | 51.27700 | 3.2770046 |
| The Devil Went Down to Georgia | The Charlie Daniels Band | 1979 | 70s | 50 | 51.41426 | 1.4142591 |
| We Are Family | Sister Sledge | 1979 | 70s | 53 | 50.38720 | 2.6127982 |
| Boogie Wonderland | Earth, Wind & Fire | 1979 | 70s | 57 | 51.02389 | 5.9761146 |
| Stomp! | The Brothers Johnson | 1980 | 80s | 46 | 50.70625 | 4.7062473 |
| Emotional Rescue | The Rolling Stones | 1980 | 80s | 53 | 50.67600 | 2.3239994 |
| You’re Only Lonely | J.D. Souther | 1980 | 80s | 57 | 51.44420 | 5.5558018 |
| How ’Bout Us | Champaign | 1981 | 80s | 45 | 51.34459 | 6.3445891 |
| America | Neil Diamond | 1981 | 80s | 62 | 52.01692 | 9.9830784 |
| Do You Believe in Love | Huey Lewis and the News | 1982 | 80s | 51 | 51.63732 | 0.6373187 |
| Wasted on the Way | Crosby, Stills & Nash | 1982 | 80s | 57 | 51.73079 | 5.2692146 |
| One Hundred Ways | Quincy Jones | 1982 | 80s | 61 | 51.10581 | 9.8941907 |
| 1999 | Prince | 1983 | 80s | 41 | 50.41881 | 9.4188082 |
| Tell Her About It | Billy Joel | 1983 | 80s | 45 | 50.91198 | 5.9119822 |
| Too Shy | Kajagoogoo | 1983 | 80s | 50 | 50.92466 | 0.9246598 |
| Don’t Let It End | Styx | 1983 | 80s | 60 | 51.61601 | 8.3839912 |
| Twist of Fate | Olivia Newton-John | 1984 | 80s | 42 | 51.38148 | 9.3814798 |
| Let the Music Play | Shannon | 1984 | 80s | 49 | 51.12616 | 2.1261579 |
| Almost Paradise | Mike Reno and Ann Wilson | 1984 | 80s | 59 | 51.01184 | 7.9881551 |
| Legs | ZZ Top | 1984 | 80s | 60 | 50.84951 | 9.1504861 |
| Freeway of Love | Aretha Franklin | 1985 | 80s | 43 | 50.78185 | 7.7818460 |
| You Give Good Love | Whitney Houston | 1985 | 80s | 47 | 50.96137 | 3.9613722 |
| Raspberry Beret | Prince & the Revolution | 1985 | 80s | 51 | 51.07033 | 0.0703265 |
| The Boys of Summer | Don Henley | 1985 | 80s | 53 | 51.28324 | 1.7167620 |
| If You Love Somebody Set Them Free | Sting | 1985 | 80s | 55 | 51.41054 | 3.5894570 |
| We Don’t Need Another Hero (Thunderdome) | Tina Turner | 1985 | 80s | 57 | 50.50589 | 6.4941054 |
| Material Girl | Madonna | 1985 | 80s | 58 | 50.22680 | 7.7731987 |
| Axel F | Harold Faltermeyer | 1985 | 80s | 61 | 51.69450 | 9.3055015 |
| Danger Zone | Kenny Loggins | 1986 | 80s | 42 | 51.81582 | 9.8158170 |
| If You Leave | Orchestral Manoeuvres in the Dark | 1986 | 80s | 53 | 51.70089 | 1.2991066 |
| Invisible Touch | Genesis | 1986 | 80s | 54 | 51.44331 | 2.5566886 |
| All Cried Out | Lisa Lisa and Cult Jam | 1986 | 80s | 61 | 52.02600 | 8.9739961 |
| In Too Deep | Genesis | 1987 | 80s | 47 | 51.09929 | 4.0992932 |
| Let’s Wait Awhile | Janet Jackson | 1987 | 80s | 48 | 51.17102 | 3.1710246 |
| Little Lies | Fleetwood Mac | 1987 | 80s | 51 | 51.18694 | 0.1869414 |
| Carrie | Europe | 1987 | 80s | 56 | 51.71122 | 4.2887841 |
| Together Forever | Rick Astley | 1988 | 80s | 44 | 51.60969 | 7.6096915 |
| Out of the Blue | Debbie Gibson | 1988 | 80s | 54 | 51.34833 | 2.6516727 |
| Don’t You Want Me | Jody Watley | 1988 | 80s | 55 | 51.01066 | 3.9893418 |
| I Get Weak | Belinda Carlisle | 1988 | 80s | 57 | 51.79300 | 5.2069958 |
| Girlfriend | Pebbles | 1988 | 80s | 60 | 50.23531 | 9.7646889 |
| Dirty Diana | Michael Jackson | 1988 | 80s | 61 | 51.14492 | 9.8550782 |
| Bust a Move | Young MC | 1989 | 80s | 42 | 50.60671 | 8.6067140 |
| So Alive | Love and Rockets | 1989 | 80s | 51 | 51.53978 | 0.5397772 |
| Here and Now | Luther Vandross | 1990 | 90s | 43 | 51.07842 | 8.0784194 |
| No More Lies | Michel’le | 1990 | 90s | 50 | 50.26567 | 0.2656734 |
| Do You Remember? | Phil Collins | 1990 | 90s | 53 | 50.62888 | 2.3711174 |
| Black Cat | Janet Jackson | 1990 | 90s | 59 | 50.83133 | 8.1686742 |
| Love of a Lifetime | FireHouse | 1991 | 90s | 43 | 51.66302 | 8.6630199 |
| Love Is a Wonderful Thing | Michael Bolton | 1991 | 90s | 49 | 50.87552 | 1.8755210 |
| Tom’s Diner | DNA | 1991 | 90s | 53 | 50.58501 | 2.4149931 |
| Make It Happen | Mariah Carey | 1992 | 90s | 42 | 50.48891 | 8.4889072 |
| Breakin’ My Heart (Pretty Brown Eyes) | Mint Condition | 1992 | 90s | 48 | 51.19091 | 3.1909115 |
| Mysterious Ways | U2 | 1992 | 90s | 57 | 50.92015 | 6.0798531 |
| Hip Hop Hooray | Naughty by Nature | 1993 | 90s | 45 | 51.14501 | 6.1450067 |
| Will You Be There | Michael Jackson | 1993 | 90s | 47 | 51.22222 | 4.2222166 |
| All That She Wants | Ace of Base | 1993 | 90s | 51 | 50.80370 | 0.1963001 |
| 7 | Prince and The New Power Generation | 1993 | 90s | 52 | 50.68570 | 1.3143021 |
| Here We Go Again! | Portrait | 1993 | 90s | 59 | 51.28721 | 7.7127875 |
| Loser | Beck | 1994 | 90s | 50 | 50.84596 | 0.8459649 |
| Never Lie | Immature | 1994 | 90s | 53 | 50.70103 | 2.2989700 |
| Understanding | Xscape | 1994 | 90s | 58 | 51.05689 | 6.9431103 |
| I’ll Be There for You | Method Man | 1995 | 90s | 42 | 50.61011 | 8.6101091 |
| Before I Let You Go | Blackstreet | 1995 | 90s | 46 | 51.00220 | 5.0021953 |
| I Wanna Be Down | Brandy | 1995 | 90s | 49 | 50.66276 | 1.6627553 |
| Hold On | Jamie Walters | 1995 | 90s | 52 | 51.53297 | 0.4670262 |
| Diggin’ on You | TLC | 1996 | 90s | 45 | 50.97852 | 5.9785199 |
| Elevators (Me & You) | Outkast | 1996 | 90s | 59 | 51.02793 | 7.9720719 |
| Mouth | Merril Bainbridge | 1997 | 90s | 42 | 50.56495 | 8.5649482 |
| Invisible Man | 98 Degrees | 1997 | 90s | 45 | 51.26420 | 6.2641997 |
| Get It Together | 702 | 1997 | 90s | 48 | 50.74516 | 2.7451593 |
| It’s All Coming Back to Me Now | Céline Dion | 1997 | 90s | 50 | 51.21927 | 1.2192731 |
| Never Make a Promise | Dru Hill | 1997 | 90s | 56 | 50.94679 | 5.0532125 |
| Everyday Is a Winding Road | Sheryl Crow | 1997 | 90s | 60 | 51.12416 | 8.8758406 |
| I Will Come to You | Hanson | 1998 | 90s | 50 | 51.69779 | 1.6977917 |
| Swing My Way | K. P. & Envyi | 1998 | 90s | 52 | 50.34689 | 1.6531100 |
| The Arms of the One Who Loves You | Xscape | 1998 | 90s | 53 | 51.23550 | 1.7644993 |
| My Love Is the Shhh! | Somethin’ for the People | 1998 | 90s | 54 | 50.41280 | 3.5872000 |
| Lookin’ at Me | Mase | 1998 | 90s | 59 | 50.98053 | 8.0194709 |
| Looking Through Your Eyes | LeAnn Rimes | 1998 | 90s | 60 | 51.09702 | 8.9029782 |
| From This Moment On | Shania Twain | 1999 | 90s | 57 | 51.74147 | 5.2585259 |
| Never Let You Go | Third Eye Blind | 2000 | 200s | 43 | 50.68836 | 7.6883573 |
| My Love Is Your Love | Whitney Houston | 2000 | 200s | 47 | 51.13782 | 4.1378191 |
| Wifey | Next | 2000 | 200s | 53 | 50.62591 | 2.3740912 |
| Purest of Pain (A Puro Dolor) | Son by Four | 2000 | 200s | 61 | 51.02738 | 9.9726154 |
| This I Promise You | ’N Sync | 2001 | 200s | 51 | 51.61780 | 0.6177985 |
| Only Time | Enya | 2001 | 200s | 59 | 51.48496 | 7.5150352 |
| I Do!! | Toya | 2001 | 200s | 60 | 50.57182 | 9.4281846 |
| Down 4 U | Ja Rule | 2002 | 200s | 44 | 51.10625 | 7.1062486 |
| Can’t Get You Out of My Head | Kylie Minogue | 2002 | 200s | 45 | 50.58152 | 5.5815183 |
| More Than A Woman | Aaliyah | 2002 | 200s | 58 | 50.60543 | 7.3945716 |
| I Can | Nas | 2003 | 200s | 58 | 50.81895 | 7.1810525 |
| Splash Waterfalls | Ludacris | 2004 | 200s | 42 | 51.05559 | 9.0555932 |
| With You | Jessica Simpson | 2004 | 200s | 50 | 51.11747 | 1.1174680 |
| My Happy Ending | Avril Lavigne | 2004 | 200s | 54 | 51.54583 | 2.4541705 |
| Roses | OutKast | 2004 | 200s | 56 | 50.72983 | 5.2701737 |
| Photograph | Nickelback | 2005 | 200s | 43 | 51.48917 | 8.4891656 |
| Collide | Howie Day | 2005 | 200s | 45 | 51.29150 | 6.2915044 |
| Slow Down | Bobby Valentino | 2005 | 200s | 47 | 51.29531 | 4.2953120 |
| My Boo | Usher and Alicia Keys | 2005 | 200s | 54 | 51.29793 | 2.7020699 |
| Yo (Excuse Me Miss) | Chris Brown | 2006 | 200s | 44 | 51.11746 | 7.1174554 |
| Walk Away | Kelly Clarkson | 2006 | 200s | 45 | 50.25134 | 5.2513367 |
| Sexy Love | Ne-Yo | 2006 | 200s | 51 | 50.74380 | 0.2562013 |
| So What | Field Mob | 2006 | 200s | 55 | 50.22706 | 4.7729391 |
| Make Me Better | Fabolous | 2007 | 200s | 44 | 51.30994 | 7.3099359 |
| Waiting on the World to Change | John Mayer | 2007 | 200s | 47 | 51.41487 | 4.4148652 |
| Because of You | Ne-Yo | 2007 | 200s | 57 | 50.26307 | 6.7369288 |
| I Tried | Bone Thugs-n-Harmony | 2007 | 200s | 58 | 51.09258 | 6.9074216 |
| Shawty | Plies | 2007 | 200s | 60 | 51.78543 | 8.2145658 |
| Our Song | Taylor Swift | 2008 | 200s | 41 | 50.81990 | 9.8199019 |
| Damaged | Danity Kane | 2008 | 200s | 42 | 50.33752 | 8.3375157 |
| Sorry | Buckcherry | 2008 | 200s | 44 | 51.46058 | 7.4605799 |
| Independent | Webbie | 2008 | 200s | 45 | 51.35510 | 6.3551001 |
| Can’t Believe It | T-Pain | 2008 | 200s | 46 | 51.46675 | 5.4667518 |
| Like You’ll Never See Me Again | Alicia Keys | 2008 | 200s | 47 | 51.69597 | 4.6959693 |
| What You Got | Colby O’Donis | 2008 | 200s | 51 | 51.02000 | 0.0199960 |
| Sweetest Girl (Dollar Bill) | Wyclef Jean | 2008 | 200s | 56 | 50.64075 | 5.3592479 |
| Miss Independent | Ne-Yo | 2008 | 200s | 57 | 50.69724 | 6.3027559 |
| Obsessed | Mariah Carey | 2009 | 200s | 41 | 50.93942 | 9.9394194 |
| Love Lockdown | Kanye West | 2009 | 200s | 44 | 51.31590 | 7.3158955 |
| Gotta Be Somebody | Nickelback | 2009 | 200s | 51 | 51.55736 | 0.5573562 |
| Beautiful | Akon | 2009 | 200s | 54 | 50.75364 | 3.2463580 |
| Bulletproof | La Roux | 2010 | 200s | 42 | 50.62884 | 8.6288413 |
| Hard | Rihanna | 2010 | 200s | 49 | 51.94305 | 2.9430544 |
| Young Forever | Jay-Z | 2010 | 200s | 50 | 51.30314 | 1.3031445 |
| According to You | Orianthi | 2010 | 200s | 55 | 50.95062 | 4.0493810 |
| Over | Drake | 2010 | 200s | 60 | 50.85600 | 9.1440038 |
| Animal | Neon Trees | 2010 | 200s | 61 | 51.44956 | 9.5504411 |
| Dynamite | Taio Cruz | 2011 | 2010s | 44 | 50.41759 | 6.4175857 |
| Moment 4 Life | Nicki Minaj | 2011 | 2010s | 50 | 51.55146 | 1.5514642 |
| Just a Dream | Nelly | 2011 | 2010s | 52 | 51.49288 | 0.5071200 |
| Motivation | Kelly Rowland | 2011 | 2010s | 53 | 51.09167 | 1.9083346 |
| Jar of Hearts | Christina Perri | 2011 | 2010s | 55 | 51.22755 | 3.7724481 |
| Hold It Against Me | Britney Spears | 2011 | 2010s | 60 | 51.10170 | 8.8983011 |
| Feel So Close | Calvin Harris | 2012 | 2010s | 42 | 50.56060 | 8.5605972 |
| Domino | Jessie J | 2012 | 2010s | 46 | 50.39532 | 4.3953195 |
| Home | Phillip Phillips | 2012 | 2010s | 49 | 51.42802 | 2.4280172 |
| Not Over You | Gavin DeGraw | 2012 | 2010s | 60 | 51.50651 | 8.4934947 |
| Summertime Sadness | Lana Del Rey and Cédric Gervais | 2013 | 2010s | 45 | 51.55113 | 6.5511293 |
| I Need Your Love | Calvin Harris | 2013 | 2010s | 56 | 50.86921 | 5.1307858 |
| Some Nights | Fun | 2013 | 2010s | 58 | 51.14364 | 6.8563565 |
| Classic | MKTO | 2014 | 2010s | 50 | 50.63430 | 0.6342982 |
| My Hitta | YG | 2014 | 2010s | 58 | 50.56621 | 7.4337922 |
| Hey Brother | Avicii | 2014 | 2010s | 60 | 51.36996 | 8.6300429 |
| Somebody | Natalie La Rose | 2015 | 2010s | 41 | 50.30690 | 9.3069000 |
| Nasty Freestyle | T-Wayne | 2015 | 2010s | 50 | 50.84530 | 0.8452978 |
| I Don’t Mind | Usher | 2015 | 2010s | 55 | 50.58818 | 4.4118223 |
| Wildest Dreams | Taylor Swift | 2015 | 2010s | 57 | 51.22130 | 5.7787006 |
| You Know You Like It | DJ Snake and AlunaGeorge | 2015 | 2010s | 59 | 51.20170 | 7.7982986 |
| Uma Thurman | Fall Out Boy | 2015 | 2010s | 60 | 50.81071 | 9.1892943 |
| Like I’m Gonna Lose You | Meghan Trainor | 2016 | 2010s | 42 | 51.22235 | 9.2223513 |
| Let Me Love You | DJ Snake | 2016 | 2010s | 47 | 51.35401 | 4.3540086 |
| We Don’t Talk Anymore | Charlie Puth | 2016 | 2010s | 50 | 50.84914 | 0.8491360 |
| Hands to Myself | Selena Gomez | 2016 | 2010s | 56 | 50.63782 | 5.3621810 |
| 2 Phones | Kevin Gates | 2016 | 2010s | 57 | 50.74652 | 6.2534827 |
| In the Night | The Weeknd | 2016 | 2010s | 61 | 51.54271 | 9.4572922 |
| Rockabye | Clean Bandit | 2017 | 2010s | 44 | 50.47385 | 6.4738514 |
| Feel It Still | Portugal. The Man | 2017 | 2010s | 45 | 50.43923 | 5.4392301 |
| Bank Account | 21 Savage | 2017 | 2010s | 48 | 50.86221 | 2.8622120 |
| Heathens | Twenty One Pilots | 2017 | 2010s | 58 | 50.86191 | 7.1380935 |
| Sicko Mode | Travis Scott | 2018 | 2010s | 42 | 51.00392 | 9.0039197 |
| Gucci Gang | Lil Pump | 2018 | 2010s | 44 | 50.18350 | 6.1835008 |
| Too Good at Goodbyes | Sam Smith | 2018 | 2010s | 49 | 50.80896 | 1.8089618 |
| Bodak Yellow | Cardi B | 2018 | 2010s | 54 | 50.45163 | 3.5483720 |
| Wolves | Selena Gomez and Marshmello | 2018 | 2010s | 60 | 51.17621 | 8.8237948 |
| Bartier Cardi | Cardi B | 2018 | 2010s | 61 | 51.15730 | 9.8426979 |
| Look Back at It | A Boogie wit da Hoodie | 2019 | 2010s | 41 | 50.67069 | 9.6706910 |
| A Lot | 21 Savage | 2019 | 2010s | 42 | 51.17316 | 9.1731560 |
| Mia | Bad Bunny | 2019 | 2010s | 44 | 51.16021 | 7.1602115 |
| Beautiful Crazy | Luke Combs | 2019 | 2010s | 46 | 51.14334 | 5.1433434 |
| Ritmo (Bad Boys for Life) | Black Eyed Peas and J Balvin | 2020 | 2020s | 50 | 50.46130 | 0.4612958 |
| Nobody but You | Blake Shelton and Gwen Stefani | 2020 | 2020s | 52 | 52.01898 | 0.0189764 |
| Truth Hurts | Lizzo | 2020 | 2020s | 55 | 50.85850 | 4.1415024 |
| Yummy | Justin Bieber | 2020 | 2020s | 58 | 50.78857 | 7.2114323 |
| Got What I Got | Jason Aldean | 2020 | 2020s | 60 | 51.03851 | 8.9614851 |
| Dynamite | BTS | 2021 | 2020s | 41 | 50.60775 | 9.6077492 |
| Beat Box | SpotemGottem | 2021 | 2020s | 44 | 51.16275 | 7.1627465 |
| Telepatía | Kali Uchis | 2021 | 2020s | 49 | 51.21703 | 2.2170348 |
| Bang! | AJR | 2021 | 2020s | 56 | 50.63090 | 5.3690985 |
| Essence | Wizkids | 2021 | 2020s | 60 | 50.56096 | 9.4390447 |
| Chasing After You | Ryan Hurd and Maren Morris | 2021 | 2020s | 61 | 51.17999 | 9.8200114 |
In the first table here we can see that the RMSE on the test data was
29.33202. This is even better than how the best SVM model
performed on our training set, but still very close. We did not overfit!
Thus, our model performed pretty well on the test data (relative to the
training data). Overall, however, the final model has a pretty high RMSE
and very low \(R^2\) value of
0.0044. Like, the Grammy’s data, our model is not very
predicative of the Billboard’s ranking. It seems to fit most values
within the position of 45 to 60. However,
looking at the second table, we can see some songs where the difference
between predicted value and actual value was less than or equal to
10. There are only 244 out of the 5027 test data points that
fall under this criteria. However, maybe we can see if a certain decade
has better prediction in the following plot.
# using similar plot from example final project!
tbl2a %>% filter(difference >= 29.33) %>%
ggplot(aes(x = position, y = .pred)) +
geom_abline(lty=2) + geom_point(alpha = 1) +
facet_wrap(~decade)+
labs(
title = "Test Data Set Predictions vs. Actual",
subtitle = "Greater Than 29.33 Difference;
* Note:2020s Panel has less data overall due to sample size of decade",
y = "Predicted Possition on Billboard's Year End Top 100",
x = "Position on Billboard's Year-End Top 100"
)
Since the 2020 Panel is negligent, we can compare the other six decades.
There does not seem to be a huge outlying decade in terms of points that
exceed an rmse greater than 29.33. So there are no
outlying time periods. In sum, we can conclude that The Billboard’s data
also has a hard time being modeled by AFTS. Perhaps popular music is
unpredictable. What goes viral can be unexpected. In the conclusion, I
will address further observations and reasons. However, at this point I
will leave it at the fact that AFTS alone may are not good indicators
for music popularity.
I wanted to explore, essentially, if there is a ‘recipe’ for
popularity. Are there certain audio features that make a song more
likable? The intention of this project was to see if we could determine
song popularity using Spotify Audio Features. Popularity was determined
by two sources: The Recording Academy’s Grammy Song of the Year award
and The Billboard’s Year-End Top 100. I was hoping to see that Spotify
AFTS would be distinct enough to determine Grammy status and Billboard
positioning. Since these sources are independent from each other, a best
fit model was found for each. Of the 4 classification models that were
run to predict Grammy status, the SVM model had the highest AUC value of
~0.59. The other three models, Random Forest, Nearest
Neighbors, and Lasso all performed similarly. Overall, none of the
models had a great fit. The other AUC values did not exceed
0.50. However, the SVM model did perform marginally better.
When fit to the Grammy’s test data, the SVM model performed not as well
as the training set with an AUC of ~0.49. Of the 4 regression
models that were run to predict Billboards positions, the SVM model also
had the lowest RMSE value: ~29.346. Like the Grammy’s data the
other models, Random Forest, Nearest Neighbors, and Lasso,performed
similarly with RMSE values all around 30. So, when applied to
the Billboard’s test data, the SVM model returned a very similar RMSE to
the training data of 24.33.
While the SVM models we selected performed similarly on the respective
test data, they did not perform well. First looking at the Grammy’s SVM
model prediction on the test data, we were returned an AUC of
~0.49. That is 50% correctness. And as we saw from the
confusion matrix, the model only wanted to categorize songs as ‘neither
nominated or won’. Looking at the Billboard’s SVM model prediction on
the the test data, we got an RMSE of ~24.33. As seen in the
results table, most prediction ranged from 40 to 60 while the true
values were from 1 to 100. Despite being the best performing models out
of the models we fit, the SVM predictions were not ideal.
Considering we tested a variety of models and that none of them
performed very well, it may be fair to conclude that the AFTS are not a
good or sufficient enough predictor variables for music popularity. I
think there are a few contributing factors to this. First, the way AFTS
are measured may not entirely capture the individuality of a song. Maybe
the way a song evokes emotion or it’s technical abilities cannot be
captured entirely by Spotify’s machine learning analysis that determines
the AFTS. Music theory, music production, and music cognition are highly
complex topics that contribute to the popularity of music: A simple
measurement may not accurately distinguish these concepts. Second, I
suspect that maybe what’s popular in a given year could be a but
unpredictable. From our EDA we found that some of the AFTS certainly
decade to decade. However, within each decade the range of values had a
wide spread. So, controlling for year, we see perhaps AFTS are sort of
random. This notion that I am alluding to in a more general sense is
that what goes viral (makes the Billboard’s Top 100) or what is deemed
critically acclaimed (nominated for a Grammy’s) is unpredictable in a
given year. The landscape of music is fast paced and constantly
changing. The next hit could be a sound we’ve never heard. Who knows?
Finally, my last reasoning is that the models we tested may not be
complex enough for this data set. While we tested a wide range of
models, it is possible that these could all be poor fits.
Exploring my love for music through machine learning has been incredibly
fun and rewarding. I love working with Spotify API, and I am glad I got
to create an app that could extract these features. Furthermore, this
exploration has sparked more questions. A year to year or decade to
decade comparison on AFTS would be an interesting follow up topic. There
is no question that music has evolved overtime, and it would be
interesting to explore if AFTS could predict what time period a song is
from! Regardless on the success of the these models, I look forward to
continuing my investigation on the possible ways AFTS can be used.